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We present a learning approach for localization and segmentation of objects in an 
image in a manner that is robust to partial occlusion. Our algorithm produces a bounding 
box around the full extent of the object and labels pixels in the interior that belong to the 
object. Like existing segmentation aware detection approaches, we learn an appearance 
model of the object and consider regions that do not fit this model as potential occlusions. 
However, in addition to the established use of pairwise potentials for encouraging local 
consistency, we use higher order potentials which capture information at the level of im¬ 
age segments. We also propose an efficient loss function that targets both localization and 
segmentation performance. Our algorithm achieves 13.52% segmentation error and 0.81 
area under the false-positive per image vs. recall curve on average over the challenging 
CMU Kitchen Occlusion Dataset. This is a 42.44% decrease in segmentation error and 
a 16.13% increase in localization performance compared to the state-of-the-art. Finally, 
we show that the visibility labelling produced by our algorithm can make full 3D pose 
estimation from a single image robust to occlusion. 


1 Introduction and Related Work 


In this paper we address the problem of localizing and segmenting partially occluded objects. 
We do this by generating a bounding box around the full extent of the objects, while also seg¬ 
menting the visible parts inside the box. This is different from semantic segmentation, which 
typically does not provide information about the spatial position of labelled pixels inside the 
object. While a lot of progress has been made in object detection [0, O, IZD], occlusion by 
other objects still remains a challenge. A common theme is to model occlusion geometrically 
or appearance-wise, thereby allowing it to contribute to the detection process. Wang et al. 
[IE3] use a holistic Histogram of Oriented Gradients (HOG) template [0] to scan through the 
image and use specially trained part templates for instances where some cells of the holis¬ 
tic template respond poorly. Girshick et al. [O] force the Deformable Part Model detector to 
place a trained ‘occluder’ part in regions where the original parts respond weakly. The object 
masks produced by both of these algorithms are only accurate up to the parts and hence not 
usable for many applications e.g. edge-based 3D pose estimation. Xiang and Savarese [IZ0] 
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approximate object structure in 3D using planar parts. A Conditional Random Field (CRF) 
is then used to reason about visibility of the parts when the 3D planes are projected to the 
image. However, such methods work well only for large objects that can be approximated 
with planar parts. 

Our approach is entitled Segmentation and Detection using Higher-Order Potentials (SD- 
HOP). It is based on discriminatively learned HOG templates for objects and occlusion. 
Whereas the object templates model the objects of interest, the occlusion templates provide 
discriminative support and do not model a specific occluder. Segmentation is done by con¬ 
sidering not only the response of patches to these templates, but also the segmentation of 
neighbouring patches through a CRF with higher-order connections that encompass image 
regions. 

We will compare our approach to two existing approaches that have been designed to 
handle partial occlusion. Hsiao and Hebert [O] approximate all occluders by boxes and gen¬ 
erate occlusion hypotheses by finding locations of mismatch between image gradient and 
object model gradient. These hypotheses are then validated by the visibility of other points 
of the object and by an occlusion prior which assumes all objects rest on the same planar 
surface. Our algorithm does not need such assumptions which reduce the segmentation ac¬ 
curacy. Gao et al. [DU] learn discriminative appearance models of the object and occlusion 
seen during training. Segmentation is achieved by defining a CRF to assign binary labels 
to patches based on their response to these two filters. We build on their work but add sev¬ 
eral important modifications that lead to better localization and segmentation performance. 
Firstly, we replace the edge-based pairwise terms with 4-connected pairwise terms that are 
better able to propagate visibility relations. Secondly, we introduce the use of higher-order 
potentials defined over groups of patches, allowing us to reason at the level of image seg¬ 
ments which contain much more information than pairs of patches. We also introduce a 
new loss function for structured learning that targets both localization and segmentation per¬ 
formance but is still decomposable over the energy terms. Lastly, we introduce a simple 
procedure to convert the granular patch-level object mask produced by the algorithm to a 
fine pixel-level mask that can be used to make 3D pose estimation of detected objects ro¬ 
bust to partial occlusion. Our algorithm outperforms these approaches (Hsiao and Hebert 
mi Gao et al. [DU]) at both object localization and segmentation on the CMU Kitchen 
Occlusion dataset as shown in Section [3] 

The rest of the paper is organized as follows. Section [2] describes our proposed approach. 
We present evaluations on standard datasets and our own laboratory dataset in Section [3] and 
summarize in Section |4] 


2 Method 

The training phase for SD-HOP requires a set of images with different occlusions of the 
object(s) of interest. Each training sample is (1) over-segmented and (2) annotated with a 
bounding box around the full extent of the object and a binary segmentation of the area 
inside the box into object vs. non-object pixels. Given these training images and labels, 
we train a structured Support Vector Machine (SVM) that produces the HOG templates and 
CRF weights. Figure [T] shows an overview of our approach. 

Object segmentation is done by assigning binary labels to all HOG cells within the 
bounding box, 1 for visible and 0 for occluded. Instead of making independent decisions 
for every cell, we allow neighbouring cells to influence each other. Neighbour influence can 
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Segmentation Pyramid 




Figure 1: Overview of our approach. Top : During training, images are segmented and fea¬ 
tures are extracted from pyramids of segmentations and HOG features. An SVM model is 
learned by max-margin learning. Bottom'. After training, the model can be used to infer a 
bounding box and visible segments of the object. 


take two forms: (1) pairwise terms (Rother et al. [D]) that impose a cost for 4-connected 
neighbours to have different labels and (2) higher-order potentials (Kohli et al. [D23]) that im¬ 
pose a cost for cells to have a different label than the dominant label in their segment of the 
image. These segments are produced separately by an unsupervised segmentation algorithm. 

2.1 Notation 

The label for an object in an image x is represented as y = (p, v, a), where p is the bounding 
box, v is a vector of binary variables indicating the visibility of HOG cells within p and 
a G [1, A] indexes the discrete viewpoint, p = (p x ,Py,Pa) indicates the position of the top 
left corner and the level in a scale-space pyramid. The width and height of the box are 
fixed per viewpoint as w a and h a HOG cells respectively. Hence v has w a • h a elements. 
All training images are also over-segmented to collect statistics for higher-order potentials. 
Any unsupervised algorithm can be used for this, e.g. Felzenszwalb and Huttenlocher [0] 
and Arbelaez et al. m. 

2.2 Feature Extraction 

Given an image x and a labelling y, a sparse joint feature vector l 2 f / (x,y) is formed by stacking 
A vectors. Each of these vectors has features for a different discretized viewpoint. All vectors 
except for the one corresponding to viewpoint a are zeroed out. Below, we describe the 
components of this vector. 

1. 31-dimensional HOG features are extracted for all cells of 8x8 pixels in p as described 
in Felzenszwalb et al. [□]. The feature vector is is constructed by stacking two groups 
which are formed by zeroing out different parts, similarly to Vedaldi and Zisserman 
[D3]. The visible group 0 v (x,p) has the HOG features zeroed out for cells labelled 0 
and the occlusion group 0 w (x,p) has them zeroed out for cells labelled 1. 

2. Complemented visibility labels, to learn a prior for a cell to be labelled 0: [1 w h — v]. 
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3. Count c(p) of cells in bounding box p lying outside the image boundaries, to learn a 
cost for truncation by the image boundary, similarly to Vedaldi and Zisserman [D3]. 

4. Number of 4-connected neighbouring cells in the bounding box that have different 
labels, to learn a pairwise cost. 

5. Each segment in the bounding box obtained from unsupervised segmentation defines 
a clique of cells. To learn higher-order potentials, we need a vector Shop that cap¬ 
tures the distribution of 0/1 label agreement within cliques. A vector 6 C G M^ +1 is 
constructed for each clique c as (Q c )k = 1 if Hiec v i = k- The sum of all G c within p 
gives Shop- In practice, since cliques do not have the same size we employ the nor¬ 
malization strategy described in Gould [O] and transform statistics of all cliques to a 
standard clique size K (K = 4 in our experiments). 

6. The constant 1, used to learn a bias term for different viewpoints. 

2.3 Learning 

Suppose w is a vector of weights for elements of the joint feature vector. We define w rv P(x, y) 
as the ‘energy’ of the labelling y. The aim of learning is to find w such that the energy of the 
correct label is minimum. Hence we define the label predicted by the algorithm as 

/(x; w) = y* = argminw r T / (x,y) (1) 

y 

We use a labelled dataset (x ; -, y i )^ 1 and learn w by solving the following constrained Quadratic 
Program (QP) 

mud I|w|| 2 + C££ (2) 

2 i=1 

s.t. w r ('P(x ! -,y;) - x P(x ! -,y,-)) + & > A(y,,y ; ) Vi, y G Y t 

&>o Vi 

D 2 w > 0 

Intuitively this formulation requires that the score w r T / (x i -,y i ) of any ground truth labelled 
image x; must be smaller than the score w r T / (x i -,y / ) of any other labelling y l by the distance 
between the two labellings A(y i ,y i ) minus the slack variable <^, where ||w ||2 and & are 
minimized. The regularization constant C adjusts the importance of minimizing the slack 
variables. The above formulation has exponential constraints for each training image. For 
tractability, training is performed by using the cutting plane training algorithm of Joachims 
et al. [O] which maintains a working set Yi of most violated constraints (MVCs) for each 
image. Gould [O] adapts this algorithm for training higher-order potentials. It uses D 2 as a 
second order curvature constraint on the K + 1 weights for the higher-order potentials, which 
forces them to make a concave lower envelope. This encourages most cells in the image 
segments to agree in visibility labelling. D 2 is an appropriately 0-padded (to the left and 
right) version of 

"-1 2 1 0 ..." 

0 ... -1 2 -1 
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The distance between two labels y and y is called the loss function. It depends on the amount 
of overlap between the two bounding boxes and the Hamming distance between the visibility 
labellings 


A(y,y) 


/ area(pplp) A 
V area(pljp)/ 


area(pflp) 

area(pUp) 




(3) 


The mean Hamming distance H(v,v) between two labellings v and v (potentially having dif¬ 
ferent sizes as they might belong to different viewpoints) is calculated after projecting them 
to the lowest level of the pyramid. By construction of the loss function, the difference in seg¬ 
mentation starts contributing to the loss only after the two bounding boxes start overlapping 
each other. It also has the nice property of decomposing over the energy terms, as described 
in Section 12.4.11 


2.4 Inference 


To perform the inference as described in Eq. [I] we have to search through Y = A x P xV 
where A is the set of viewpoints, P is the set of all pyramid locations and V is the exponential 
set of all combinations of visibility variables. We enumerate over A and P and use an s — t 
mincut to search over V at every location. 

By construction, the feature vector w can be decomposed into weight vectors for the 
different viewpoints i.e. w = [w 1 , w 2 ,... , w A ] . In the following description, we will consider 
one viewpoint and omit the superscript for brevity of notation, w can also be decomposed 
as [w v , Wm,, w pn w t runc,W, w hop, w bias\ into the six components described in Section]^] We 
define the following terms that are used to construct the graph shown in Figure |2(b)| <fc(x, p) 
are the vectorized HOG features extracted at cell i in bounding box p. Unary terms U(p) = 
w v ,z r 0z(x,p) and 5/(p) = w W} / r 0/(x,p) are the responses at cell i for object and occlusion 
filters respectively. Ri = w pr j is the prior for cell i to be labelled 0. Constant term C(y) = 
w trunc • c(p) + Wbias is the sum of image boundary truncation cost and bias. S is the set of 
4-connected neighbouring cells in p and W is the pairwise weight. C( p) is the set of all 
cliques in p and y c (y c ) is the higher-order potential for clique c having nodes with visibility 
labels v c . Combining these terms, the energy for a particular labelling is formulated as 


wh 


£(x,y) = w rv P(x,y) = £^(p)v;+5;(p)( 1 -v,-) +/?,-(! -v,-) 


1=1 


+ ^ w\vi~vj\+ ^ v / c(vc)+C'(y) 
(i,j)e£ ceC( p) 


(4) 


i i// c (y c ), the higher-order potential for clique c is defined as min^ = i..x ( s kHiec v i + ^), fol¬ 
lowing Gould [O]. Intuitively, it is the lower envelope of a set of lines whose slope is defined 
as s k = f (( w H op)k ~ (wHOp)k- 1 ) and intercept as b k = ( w H op)k ~ $kk (recall that w hop is a 
K + 1 dimensional weight vector). M is the size of the clique. The normalization in s k makes 
the potential invariant to the size of the clique (refer to Gould [O] for details). Figure |2(a)| 
shows a sample higher-order potential curve for a clique of K cells. 

Given an image, a location, and a viewpoint we use s — t mincut on the graph construc¬ 
tion shown in Figure |2(b)| to find the labelling v that minimizes the energy in Eq. [4] Each 
variable v*, i £ {1,2,..., wh} defines a node and each clique has K — 1 auxiliary nodes in 
the graph, z\ •. -Zk -\• For a detailed derivation of this graph structure please see Boykov 
and Kolmogorov [0] and Gould [O]. After the maxflow algorithm finishes, the nodes v; still 
connected to s are labelled 1 and others are labelled 0. 
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Figure 2: (a): Concave higher-order potentials encouraging cells in a clique to have the same 
binary label, (b): Construction of graph to compute the energy minimizing binary labelling 
of cells by s — t mincut. 


Algorithm 1 Response-transfer between object detectors in overlapping regions 
for all o G [1 ,L] do {L is the number of objects, o denotes the Hadamard product} 

for all p G P do 

2?(p) = C(p) o l[V(p) 7^ 0] {Transfer equation for all cells in p} 

end for 

y* = argmin y w Tx ¥ (x, y) 

C(y*(p)) = ^(y*(p))°y*(v) {Update equations for all cells in p} 
V(y*(p))=^-y*(v) 

end for 


2.4.1 Loss-augmented Inference 

Loss-augmented inference is an important part of the cutting plane training algorithm (‘sep¬ 
aration oracle’ in Joachims et al. [O]) and is used to find the most violated constraints. It 
is defined as yMVC = argmin^ w rv F(x,y) — A(y,y), where y is the ground-truth labelling. 
Our formulation of the loss function makes it solvable with the same complexity as normal 
inference (Eq. |T]) by decomposing the loss over the terms in Eq[4] The first term of Eq.[3]is 
added to C(y), while the second term is distributed across p) and Bj( p) in Eq.[4] 

2.5 Detection of Multiple Objects 

Multiple objects of interest might overlap. Running the individual object detectors separately 
leaves regions of ambiguity in overlapping areas if multiple detectors mark the same location 
as visible. We find that running one iteration of a-expansion (see Boykov et al. [i]) in 
overlapping areas resolves ambiguities coherently. The detectors are run sequentially. We 
maintain a label map V that stores for each cell the label of the object that last marked it 
visible, and a collected response map C that stores for each cell the object filter response 
CR(p)) f rom the object that last marked it visible. While running the location search for 
object < 9 , we transfer object filter responses from C to the occlusion filter response map 
p)) for the current object as described in Algorithm [T] This is effectively one iteration 
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of a-expansion (see supplementary material for details). It causes decisions in overlapping 
regions to be made between responses of well-defined object filters rather than between 
responses of an object filter and a generic occlusion filter. 

Such response-transfer requires the object models to be compatible with each other. We 
achieve this by training the object models together as if they were different viewpoint com¬ 
ponents of the same object. The bias term in the feature vector makes the filter responses of 
different components comparable. 


2.6 3D Pose Estimation 

The basic principle of many model based 3D pose estimation algorithms is to fit a given 
3D model of the object to its corresponding edges in the image e.g. in Choi and Chris¬ 
tensen [B], the 3D CAD model is projected into the image and correspondences between 
the projected model edges and image edges are set up. The pose is estimated by solving 
an Iterative Re-weighted Least Squares (IRLS) problem. However, partial occlusion causes 
these approaches to fail by introducing new edges. We make the algorithm robust to partial 
occlusion by first identifying visible pixels of the object using SD-HOP and discarding cor¬ 
respondences outside the visibility mask. We call our extension of the algorithm Occlusion 
Reasoning-IRLS (OR-IRLS). 


3 Evaluation 

We implemented SD-HOP in Matlab, with MVC search and inference implemented in CUDA 
since they are massively parallel problems. Inference on a 640x480 image with 11 scales 
takes 3s for a single object with a single viewpoint on our 3.4 GHz CPU and NVIDIA GT- 
730 GPU. 

3.1 Localization and Segmentation 

We evaluated our approach on the CMU Kitchen Occlusion Dataset from Hsiao and Hebert 
mi This dataset was chosen because (1) it provides extensive labelled training data in the 
form of images with bounding boxes and object masks, and (2) the dataset is challenging 
and offers the opportunity to compare against an algorithm designed specifically to handle 
occlusion. For the localization task we generated false positives per image (FPPI) vs. recall 
curves, while for the segmentation task we measured the mean segmentation error against 
ground truth as defined by the Pascal VOC segmentation challenge in Everingham et al. [□]. 
C = 25 (see eq.[2]) was chosen by 5-fold cross-validation. While both results are presented 
for the single pose part of the dataset, multiple poses are easily handled in our algorithm as 
different components of the feature vector. Figure [3] shows FPPI vs. recall curves compared 
with those reported by the rLINE2d+OCLP algorithm of Hsiao and Hebert [Q] and those 
generated from our implementation of Gao et al. [DU]. Table [T] presents segmentation errors 
compared with Gao et al. [DU]. Hsiao and Hebert [101] do not report a segmentation of the 
object. 

Figure [3] shows that while both SD-HOP and Gao et al. [DU] have similar recall at 1.0 
FPPI, SD-HOP consistently preforms better in terms of area under the curve (AUC). Aver¬ 
aged over the 8 objects, SD-HOP achieves 16.13% more AUC than Gao et al. [DU]. Table[l] 
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Bakingpan 


Colander 


Thermos 


Pitcher 




Figure 3: Object localization results on the CMU Kitchen Occlusion dataset 
Table 1: Mean object segmentation error 


Object 

Gao et al. [HD] 

SD-HOP 

Bakingpan 

0.2904 

0.1516 

Colander 

0.2095 

0.1249 

Cup 

0.2144 

0.1430 

Pitcher 

0.2499 

0.1131 

Saucepan 

0.1956 

0.1103 

Scissors 

0.2391 

0.1649 

Shaker 

0.2654 

0.1453 

Thermos 

0.2271 

0.1285 


Table 2: Mean 3D pose estimation error 


Pose parameter 

IRLS 

OR-IRLS 

X (cm) 

1.6874 

0.5774 

Y (cm) 

1.4953 

0.6516 

Z (cm) 

8.228 

2.1506 

Roll (degrees) 

1.1711 

0.7152 

Pitch (degrees) 

7.9100 

2.3191 

Yaw (degrees) 

5.7712 

2.6055 


shows that SD-HOP consistently outperforms Gao et al. [DU] in terms of segmentation er¬ 
ror, achieving 42.44% less segmentation error averaged over the 8 objects. Figure [5] shows 
examples of the algorithm’s output on various images from the CMU Kitchen Occlusion 
dataset. 

3.2 Ablation Study 

We conducted an ablation study on the ‘pitcher’ object of the CMU Kitchen Occlusion 
dataset to determine the individual effect of our contributions. Using the loss function 
from Gao et al. mu caused the segmentation error to increase from 0.1131 to 0.1547 and 
area under curve (AUC) of FPPI vs. recall to drop from 0.7877 to 0.7071. To discern the 
effect of 4-connected pairwise terms we removed the higher order terms from the model too. 
Using the pairwise terms as described in Gao et al. [DU] caused the segmentation error to 
increase from 0.1547 to 0.2499 and AUC to decrease from 0.7071 to 0.6414. 

Lastly, to quantify the effect of higher order potentials, we compared the full SD-HOP 
model against one with higher order potentials removed. Removing higher order potentials 
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Figure 4: 3D pose estimation. Left to right: Pose estimation with IRLS, SD-HOP raw 
segmentation mask, SD-HOP refined segmentation mask, Pose estimation with OR-IRLS. 
Best viewed in colour. 


caused the segmentation error to increase from 0.1131 to 0.1430 and AUC to drop from 
0.7877 to 0.7544. We hypothesize that for small objects like the ones in the CMU Kitchen 
Occlusion dataset, 4-connected pairwise terms are almost as informative as higher order 
terms. To check this hypothesis we tested the effect of removing higher order potentials on 
a close-up dataset of 41 images of a pasta-box occluded by various amounts through various 
household objects. Removing the higher order potentials caused the segmentation error to 
increase from 0.1308 to 0.1516 and area under curve AUC to drop from 0.9546 to 0.9008. 
This indicates that higher order terms are more useful for objects with larger and hence more 
informative segments. 


3.3 3D Pose Estimation 

We collected 3D pose estimation results produced by IRLS and OR-IRLS on a dataset which 
has 17 images of a car-door in an indoor environment. The ground truth pose for the cardoor 
was obtained by an ALVAR marker alv [□]. Table [2] shows the mean errors in the six pose 
parameters. To discern the effect of errors inherent in the pose estimation process from the 
effect of occlusion reasoning, the pose of the cardoor was constant throughout the dataset, 
with various partial occlusions being introduced. 

The granular HOG cell-level mask produced by SD-HOP caused some important silhou¬ 
ette edges to be missed for pose estimation. To solve this problem we utilized the unsuper¬ 
vised segmentation done earlier for defining higher order terms. If more than 80% of the 
area within a segment was marked 1, we marked the whole segment with 1. Since segments 
follow object boundaries, this produced much cleaner masks for pose estimation. Figure]?] 
shows the masks and pose estimation results for an example image from the dataset, with 
more such examples presented in the supplementary material. Note that the segmentation 
errors mentioned in Table Q] use the raw masks. 


4 Conclusion 

We presented an algorithm (SD-HOP) that localizes partially occluded objects robustly and 
segments their visible regions accurately. In contrast to previous approaches that model oc¬ 
clusion, our algorithm uses higher order potentials to reason at the level of image segments 
and employs a loss function that targets both localization and segmentation performance. We 
demonstrated that our algorithm outperforms existing approaches on both tasks, when eval¬ 
uated on a challenging dataset. Finally, we have shown that the segmentation output from 
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Figure 5: Object localization and segmentation results on the CMU Kitchen Occlusion 
dataset. Left: Image, Center: Raw mask from SD-HOP, Right: Refined mask from SD-HOP 


SD-HOP can be used to improve pose estimation performance in the presence of occlusion. 
Avenues of future research include (1) training from weakly labelled data i.e. without seg¬ 
mentations, (2) a post-training algorithm to make object models comparable without having 
to train them together, and (3) using the occlusion information to reason about interactions 
between objects in scene understanding applications. 

We would like to acknowledge Ana Huaman Quispe’s help with implementing this sys¬ 
tem on a bimanual robot. The system was used to enable the robot to pick up partially visible 
objects lying on a table. 
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